(include how and where the data were acquired, how you cleaned and wrangled the data, what tools you used for data exploration)
The data was obtained from CDC chronic disease and health promotion data & indicators: https://chronicdata.cdc.gov/Heart-Disease-Stroke-Prevention/Heart-Disease-Mortality-Data-Among-US-Adults-35-by/i2vk-mgdh
Data variables included:
- Year: Center of 3-year average
- LocationAbbr: State, Territory, or US postal abbreviation
- LocationDesc: county name
- GeographicLevel: county/state
- DataSource
- Class: Cardiovascular Diseases
- Topic: Heart Disease Mortality
- Data_Value: heart disease death rate
- Data_Value_Unit: per 100,000 population
- Data_Value_Type: Age-adjusted, Spatially Smoothed, 3-year Average Rate
- Data_Value_Footnote_Symbol
- Data_Value_Footnote
- StratificationCategory1: gender
- Stratification1: gender categories
- StratificationCategory2: race
- Stratification2: race categories (White, Black Hispanic, Asian and Pacific Islander, American Indian and Alaskan Native)
- TopicID
- LocationID
- FIPS code
- Location 1: lat&lon
#library R packages
library(gsubfn)
## Loading required package: proto
## Could not load tcltk. Will use slower R code instead.
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(dtplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(leaflet)
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
library(raster)
## Loading required package: sp
##
## Attaching package: 'raster'
## The following object is masked from 'package:tidyr':
##
## extract
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:data.table':
##
## shift
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:raster':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(rjson)
# download and read in the data
if (!file.exists("Heart_Disease_Mortality_Data_Among_US_Adults__35___by_State_Territory_and_County.csv")) {
download.file("https://chronicdata.cdc.gov/api/views/i2vk-mgdh/rows.csv?accessType=DOWNLOAD",
method="libcurl",
timeout = 60
)
}
heartdisease <- data.table::fread("Heart_Disease_Mortality_Data_Among_US_Adults__35___by_State_Territory_and_County.csv")
# check for head, tail and whether NAs exist
knitr::kable(dim(heartdisease))
knitr::kable(summary(is.na(heartdisease)))
|
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
Mode :logical |
|
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:32149 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
FALSE:59076 |
|
NA |
NA |
NA |
NA |
NA |
NA |
NA |
TRUE :26927 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
Based on the summary table, only Data_Value contained NAs which referred to insufficient data. I decided to replace NAs by 0 for later convenience.
#remove NAs
heartdisease$Data_Value <- heartdisease$Data_Value %>% replace_na(0)
The summary table indicated that there were no NAs anymore.
Based on the main question, California data was selected
# selec data in California
heartdisease_CA <- heartdisease[LocationAbbr == 'CA' & GeographicLevel == 'County']
The Location 1 contained latitude and longitude information in one column, it would efficient to separate them into two columns.
# remove "()" in strings
heartdisease_CA$`Location 1` <- gsub("[()]", "", heartdisease_CA$`Location 1`)
# separate lat and lon variables
heartdisease_CA <- heartdisease_CA %>%
separate(col = 'Location 1', into=c('lat', 'lon'), sep=',')
Convert Data_Value, lat, lon into num class
# convert chr to num
heartdisease_CA$Data_Value <- as.numeric(heartdisease_CA$Data_Value)
heartdisease_CA$lat <- as.numeric(heartdisease_CA$lat)
heartdisease_CA$lon <- as.numeric(heartdisease_CA$lon)
CA_gender contained the heart disease mortality data based on gender category. CA_race contained the heart disease mortality data based on race category. CA_overall contained the data without any stratification.
# select data under each stratification
CA_gender <- heartdisease_CA[Stratification1 != 'Overall' & Stratification2 == 'Overall']
CA_race <- heartdisease_CA[Stratification2 != 'Overall' & Stratification1 == 'Overall']
Since there were 58 counties in CA in total, the dataset seemed to be reasonable.